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Abstract 

Given a probability law 7r on a set S and a function g : S — > R, 
suppose one wants to estimate the mean g = J g dir. The Markov 
Chain Monte Carlo method consists of inventing and simulating a 
Markov chain with stationary distribution n. Typically one has no 
a priori bounds on the chain's mixing time, so even if simulations 
suggest rapid mixing one cannot infer rigorous confidence intervals for 
g. But suppose there is also a separate method which (slowly) gives 
samples exactly from it. From n exact samples, one could immediately 
get a confidence interval of length 0(n~ 1 / 2 ). But one can do better. 
Use each exact sample as the initial state of a Markov chain, and run 
each of these n chains for m steps. We show how to construct con- 
fidence intervals which are always valid, and which, if the (unknown) 
relaxation time of the chain is sufficiently small relative to m/n, have 
length 0(n _1 logn) with high probability. 

Keywords: Confidence interval, Exact sampling, Markov chain Monte 
Carlo. 
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1 Background 



Let 7r be a given probability distribution on a set S. Given a function 
g : S —* R, we want to estimate its mean g := f s g(s)ir(ds). As we learn 
in elementary statistics, one can obtain an estimate for g by taking samples 
from 7T and using the sample average g- value as an estimator. But algo- 
rithms which sample exactly from tt may be prohibitively slow. This is the 
setting for the Markov chain Monte Carlo (MCMC) method, classical in 
statistical physics and over the last ten years studied extensively as statis- 
tical methodology [Q, [5], |9|, |l2[ . In MCMC one designs a Markov chain on 
state-space S to have stationary distribution tt. Then the sample average 
g- value over a long run of the chain is a heuristic estimator of g. Diagnostics 
for assessing length of run required, and expressions for heuristic confidence 



intervals, form a substantial part of MCMC methodology [11]. In general 
one cannot make such estimates rigorous, because one cannot eliminate the 
possibility that all the samples seen come from some small part of the state 
space which is almost disconnected from the remainder. Rigorous estimates 
typically require an a priori bound on some notion of the chain's mixing time 
(e.g. the relaxation time defined at (^)); and while there is now substantial 
theoretical literature on mixing times Jl|, [3], || it deals with settings more 
tractable than most statistical applications. 

This paper investigates the interface between rigor and heuristics in a 
particular (perhaps artificial) context. Suppose we have a guess f for the 
mixing time of the chain, based on simulation diagnostics or heuristic esti- 
mates H or some non-rigorous mathematical argument. Suppose we have 
some separate scheme (an exact sampler) which gives independent samples 
exactly from tt. Imagining that sampling f steps of the chain is roughly 
equivalent to sampling once from ir, it is natural to consider the ratio 

cost of one exact sample 



P 



cost of f steps of the chain 



where cost refers to computational time. If p < 1 then one would just use the 
exact sampler and forget MCMC. If p is extremely large then we might not 
be able to afford even one exact sample, and we are forced to rely on MCMC 
(this is the setting typically envisaged in MCMC). This paper addresses the 
remaining context, where p is large but not extremely large; in other words, 
we can afford to simulate many steps of the chain (enough to make estimates 
heuristically good) but can afford only a few exact samples. In the case of 
sampling from general d-dimensional densities, for instance, exact samplers 
(e.g. based on rejection sampling using some tractable comparison density) 
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are typically feasible only for small d, and MCMC is used for large d, so 
there should always be a window of d-values which fits our "p large, but not 
extremely large" context. 

In this context, we could just use the exact sampler to get n independent 
samples from tt. Then the sample average g- value provides an estimate of 
g with 0(ra~ 1//2 ) error. But instead, suppose we use these n independent 
samples as initial states and generate n independent m-step realizations 
of the Markov chain. If diagnostic tests suggest that mixing occurs in f 
steps then we have an "effective sample size" of (nxm/f) and the heuristic 
estimate of error (when we use the overall sample average g-value as an 
estimator) would be Ofy'f/fnm)). Our main result, Theorem 2.1 , shows 
that in a certain sense such error bounds can be made rigorous. 



2 Results 

The discussion in section [l] provides conceptual context for our result, but let 
us now state the (rather minimal) mathematical assumptions for the result. 
For simplicity we assume the state space S is finite (though since our results 
are non-asymptotic they must extend to the general case without essential 
change). We assume (for reasons explained in Section ^l]) the function 
g : S —* R is bounded, so by rescaling we may assume 

<<?(•)<!• (1) 

We assume the Markov chain is reversible, that is to say its transition matrix 
K satisfies 

TTikij = TTjkji, (2) 

These are the only background assumptions for validity of the conservative 
confidence interval given at (||). That is, there are no "implicit asymptotics" ; 
and we do not even need to assume K is irreducible. The length of the 
confidence interval will depend on the data, i.e. the realizations of the 
chain, but (0) bounds the length in terms of the relaxation time of the 
chain, defined as 

r 2 :=(l-A 2 )- 1 (3) 

where 1 = Ai > A 2 > • • • > A s > —1 are the eigenvalues of K, and r 2 < oo 
if K is irreducible. 

Theorem 2.1 Assume (0,^- Take n > 3, m > 1 and < a < 1 and f > 1. 

Based on 2n exact samples from ir and 2mn steps of the K-chain, we can 
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construct an interval I such that 



and 



P(g?I)<a (4) 

P (length(I) > k a max \ ±, \f^j logn^j 

< 3n(n + 1) exp (- log 2 n \ , (5) 



where k a := 2 ( y2/a + log(4/a) ) . 
2.1 Discussion 



To interpret Theorem 2.1, suppose we take m = nf, so that we use 2n 2 f 
steps of the chain. Then the "target length" (in the left side of (|5|)) of 
our confidence interval will be (9(n _1 log n). The theorem guarantees a 
confidence interval that is always valid, and guarantees that, if t<i is indeed 
not more than f , then the length of the confidence interval will likely not 
exceed the target length. This contrasts with the 0{n~ 1 / 2 ) length confidence 
interval obtained by using the exact sampler only, and gets close to the 
0(n -1 ) length of the heuristic confidence interval. 

Admittedly our numerical error bounds are too crude to be much use in 
practice. For example in order to make the bound in (0) smaller than say 5%, 
one would have to generate at least 300 exact samples, which is most often 
not practicable. So the result is primarily of theoretical interest, in particu- 
lar because of its similarity to the idea Jl(| of self-testing algorithms. That 
paper describes an algorithm for generating random self-avoiding walks. As 
the authors write |l(| "While there are a number of Monte Carlo algorithms 
used to solve these problems in practice, these are heuristic and their cor- 
rectness relies on unproven conjectures. In contrast, our algorithm is shown 
rigorously to produce answers with specific accuracy and confidence. Only 
the efficiency of the algorithm relies on a widely believed conjecture, and 
a novel feature is that this conjecture can be tested as the algorithm pro- 
ceeds." In our MCMC setting, we cannot estimate rigorously the actual 
value of T2, but we can self-justify inferences based on estimated t%. 



On a more technical note, let us outline why Theorem 2.1 gives close 
to the best possible bounds on confidence interval length. Indeed, we claim 
that the best one could hope for is length of order 



1 / T2_\ 



max -,*/ • (6) 

1 n V nm J 



4 



The point is that there are two different "obstacles" to sharp estimation. 
First, consider the eigenvector #2 associated with eigenvalue A2. It is easy 
to estimate the variance of the overall sample average when g = g%, and 
this variance works out as order — ; so we should not hope to have smaller 
estimation error than the corresponding standard deviation */ Second, 
suppose some subset A of the state space, with n(A) = 1/n, is almost dis- 
connected. Then it is not unlikely that all n exact samples, and hence the 
n realizations of the chains, miss A, and so a contribution ^ 7r [5f(-)l^] to g 
would be "invisible" to our simulations, and so this contribution is an un- 
avoidable source of possible error when using sample averages as estimators. 
Our assumption (Q) that g is bounded was intended as the simplest way of 
bounding this error - bounding it as order 1/n. 



So Theorem ^lj shows that, if our initial guess f is indeed roughly close 



to T2, then our rigorous confidence interval's length will be roughly of the 



minimal order (J6|) , up to logn terms. In Section |2.3| we give a natural 
"adaptive" variation in which we prescribe two numbers f < r max , where as 
before f is a heuristic estimate of T2, and where 2n 2 f max is the maximum 
number of steps of the chain that we would be willing to simulate. Theorem 



2.2 gives an always- valid confidence interval which, if T2 is indeed small 
relative to f max , will have length of order n _1 logn and will require order 
n 2 T2 steps of the chain. 

2.2 Outline of construction and proof 

Recall the "procedure" of simulating n realizations of m steps of the Markov 
chain, starting from n exact samples from tt. The construction of the confi- 
dence interval I in Theorem ^l] can be summarized as follows. 

(i) Perform this procedure once, and find the overall average g-value - call 
it g*. 

(ii) Perform the procedure again, and for 1 < i < n let Ai be the average 
g- value over the i'th m-step realization. 

(iii) Test whether \Ai— g* I < , logn for every i, where r(n, m) := min (n, 9) . 

-y/r(n,m) 



If so, report a "short" confidence interval 
if not, report a "long" confidence interval 



ave^i ± O ^max f ~, x > ^ j log /; 
[ave^±0(^)]. 



To analyze the validity of the confidence interval, the key point is that after 
observing the event U \A{ — g*\ < log n " happening n times out of n, we 

y/ fill. Ill) 

can be confident that its probability is 1— 0(1/ n). This allows us to truncate 
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Ai at g* ± , log = , and then the sample average of n truncated variables 



has s.d. of order ° gn x -4= = max ( -, \ — | logn. 

y/r{n,m) V™ \ n ' V nm J 6 

Finally, to bound the chance of not reporting the short confidence inter- 
val we need to bound the chance of a truncation being needed, and a bound 
can be derived from large deviation estimates for reversible chains. 

2.3 An adaptive version 



In the procedure underlying Theorem 2.1 we make a single guess f and 
hope that the "good event" which leads to a short confidence interval will 
happen; if not, we settle for a long confidence interval. A natural variation 
is to specify that, if the "good event" fails, then repeat the process with f 
replaced by 2f , 4f , 8f , . . . and continue until the "good event" happens or 
until we reach some predetermined limit on numbers of steps of the chain. 

Theorem 2.2 Assume (Q,[§/ Take n > 5, < a < 1, and 1 < f < f max = 
2 a f, where a > is an integer. Then based on 2n exact samples from ir, 
and 2n 2 x M steps of the K-chain, where M is a random variable taking 
values in {f , 2f , 2 2 f , . . . , 2 a f} ; we can define an interval I, such that 



and 



and 



P (3 2 1) < a, (7) 



loff n M 
length(I) < k a a if 1 < — < 2 a , (8) 

n t 



P (M > 96 r 2 V f ) < 3n(n + 1) exp (- log 2 n) . (9) 



where k% = 2 y/2(a + l)/a + log(4(o + I) /a) 



So we are prescribing the maximum number of steps of the chain to be 
2n 2 f max . The bound in (||) is less than 0.05 for n = 8 and goes to zero very 
rapidly as n increases. So if f max is indeed large compared to r 2 then by 
generating a small number of exact samples one can construct a confidence 
interval for g which will be "short" with high probability, and the number 
of steps of the Markov chain required will be 0{r?(j2 Vf)). 
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3 Proof of Theorem [27t| 



3.1 Construction of the confidence interval 

Let {Z±, Z|, . . . , Z*} be n independent samples from tt. For 1 < i < n let 
O^O^o be a reversible Markov chain with initial state X* = Z*; these n 
Markov chains are independent. Define 



^ m— 1 



and 

9 



1 n 

: =-EA*- (io) 



n 

i=l 



g* is our initial guess for g. 

Now re-run the entire simulation independently to get {Z\, Z2, ■ ■ ■ , Z n }, 
another set of n independent samples from tt, and (^QjOj^o 1 another in- 
dependent but identically distributed family of n reversible Markov chains 
each with initial state JQo = Z,. We further define 



^ m— 1 

Ai-.= -VVXy), l<i<n, 



j=0 



and 



1 n 

A:=-J2^- (11) 



n . 



Truncate each to get 



f if 1^-5*1 < }°f n 

Ai := I (12) 
I 5* otherwise 

where r(n, m) : = min (n, tt) . Let 

i=l 

Write N n = Yli=l ^i.A% 7^ Ai) for the number of truncations, and call the 
event G n := [N n = 0] the good event. 
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Define h : (N U {0}) X N X (0, 1) -> [0, oo) by 
h(z, n; a) := < 



- + ^= if ^0 

if 2 = 



(14) 



where 



Cry 



1 , , , 2 

and da := log — . 

2a a 



In the next section we shall prove 
Proposition 3.1 For any b > 



> 6 max ( \,\]^ ) logn + h(N n ,n;a) 



< ¥ +a. (15) 



Replacing a by a/2 in Proposition 3J. and setting & = A/2/a, we see that 
the confidence interval 



(16) 



satisfies the requirement of (|j) that P(g ^ /) < a. If iV n = then the 
length of this confidence interval is 



/ :=A±\ W-max ± J log n + h(N n , n; a/2) 



max 



n 1 V WW 



logra + 



d 



a/2 



n 



(17) 



Notice that ( |i~7| ) is bounded by /cq, max \J~^^\ logn; here we use as- 
sumption n > 3 which implies logn > 1. So to prove (||) and complete the 
proof of Theorem 2.1, it is enough to prove (in section 3.3) 

Proposition 3.2 



P(iV n > 0) < 3n(n+ 1) exp 



m 2 , 
log n . 



48nr 2 
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3.2 Proof of Proposition |3.1 



We denote the conditional probability, conditional expectation and condi- 
tional variance given g* by , E 9 and Var 9 respectively. 

Observe that under P# , the random variables A%, A2, ■ ■ ■ ,A n are i.i.d. 

Thus ES*(1) = E 5 *(Ii), and Var r (l) = ±Var r (Ii). But Var r (i x ) = 

2 



Var 9 {Ax-g*)< , ^— 

\ V r(n,m) 

shev's inequality, we get 



log n 



, because \A\ — g*\ < 



log n 
■v/ r(n,m) 



So by Cheby- 



p9 



A -E» (^i)| >6max i ^ logn 



1 



and by taking expectation 



A-W (A 1 )\>bwax log 



1 



(18) 



Now we want to estimate |E 9 (^4i) — g\. From the definitions (11) and 
(0), 

1 n 1 n 

(A - A) = - J2 A A A i + A i) ~ ~ J2 9*I(A ± Ai). 

i=l i=l 
Since g takes values in [0, 1], 



r? 



A^n < [A- A) < -N n 



n 



(19) 



Now A is independent of g* , thus taking conditional expectation given g* in 
© we get 



ES (A^-g 



< 



Vr, 



(20) 



where p n (g*) := P 9 * M>li - <7*| > log n/y/r(n, m)j . 

Under P 9 * we have N n ~ Binomial (n,p n (g*)), and hence 



P 9 Pn(5*)>— ,^n = 

n 



1-Pn(fl*)) l( Pn (r)>rf Q /n) 



< 1 



7 \ n 



/1 



1 



(d a /n<l) 



< e~ d< * since (1 - x) < e~ x V < x < 1 
= — by definition of d a . (21) 
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Further, 



P r [Pn(9*)> — + -^,N n >Q) (22) 



< P r(^L + ^ <Pn(r 



< P r (|iV n -np n (g*)| > yfcc a ) 

p n (g*)(l — p n (g*)) 

< — k — by Chebyshev's inequality 

1 

a 

= — by definition of c a . (23) 
Taking expectations of the conditional probabilities in ( |2~l| ) and ( |23| ) we get 

pU*)>^ n =0)<^ 

and 



Thus by definition of h(-) 

v(p n (g*)>h(N n ,n;af) <a. (24) 
And hence from (|l8|), @) and (]24|) we get 



P (\ A -g\ > b max logn + h(N n , n; a) 

< p(|A-E(ii|r)| >6nia X (i^j logn") 
+ Pfp n (5*) > h(N n ,n;a) 



□ 
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3.3 Proof of Proposition 

Clearly 

P(N n >0) < nP\\A 1 -g*\> 
p(\A 1 -g\> 



log n 



< n 



log n 



2y/r(n, m) 



+ P 



9\ > 



log n 



2vM 



n.m) 



(25) 



To bound the terms of (p5|) we use a large deviation bound for sample 
averages of reversible Markov chains. Lezaud || equation (2) gives a one- 
sided bound for A\ = rrT 1 ^2^=0 

p(^-^>A)<exp(i--^), A>0. 

Since r 2 > 1/2 always, and 2e 2//5 < 3, we deduce the two-sided bound 

A 2 m" 



P^i - g\ > A) < 3exp 
So in particular 

P \\Ai-g\ > 



12r 2 



log n 



2y / r(n, m) 



< 3 exp 



< 3 exp 



A > 0. 



n 



(26) 



48?iT2 r(n, m 



log 2 n 



in 



48nr 2 



log n 



(27) 



Also, for A > 0, 



V{\?-g\>\) < nP(\A$-g\>\) 
= nP(\A 1 -g\>\) 
X 2 m 



So in particular 

P| \g*-g\ > 



log n 



2yM 



n, m) 



< 3n exp 



< 3n exp 



< 3n exp 



12r 2 



m 



n 



48nT2 r(n, m) 



log n 



48nT 2 



log 2 n 



(28) 
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Substituting fl27| ) and pq ) into J2q ) gives the bound asserted in Proposition 

□ 



3.2 



4 Proof of Theorem 12.2 



For the first part of the procedure for constructing the confidence interval I, 
simulate {Z*, 1 < i < n} and {Z{, 1 < i < n} as at the start of section [O . 
This part of the procedure is not repeated. Then for /c G {0, 1,2,... , a}, let 
\1 < i < n, 1 < j < m-fc := 2 fc nf } be realizations of chains started at 
X* Q = Z* ; and let {Xij | 1 < i < n, 1 < j < := 2 fc nf} be realizations of 
chains started at JQo = each simulated until time := 2 k nt. Repeat 

rn (fc) ~(k) — W 

definitions from Section |3|: for /c = 0, 1, ... a define ^ , , and A as 

Aj, Aj, and A respectively with m = m^. Note that for such m we have 
r(n, m) = n. 

' Let = YJU ^ A (fe) )- Define 



/ (n, /c; a) 




a n 



+ h(N^,n;a/2) 



(29) 



where h(-) is as defined in (14). This is the interval / defined at (|16| ) which 



featured in Theorem |2.1| , and so by (Q) we get that for < k < a, < a < 1, 
and n > 5, 

P(<7 £ I(n,k;a)) < a. (30) 

Define T := min{ < k < a | iVn = }, where we write T = a if the set 
is empty. Then define 

M := 2 T f £ {f,2f,2 2 f,. . . ,2 a f} 



I := l(n,T;a/(a + 1 



If < T < a then N%P = 0, and hence from (||) and (|l4|) we get that 



length(I) < k a J° gn 



n 



(31) 
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where = 2 \^j2(a + l)/a + log(4(a + l)/a)J , so (|) is satisfied. Further, 

a 

p (s^) = E p (^ J ' r = fc ) 

fc=0 
a 

= ^P(^/(n^;«/(a + l)),T = ^ 

fc=0 
a 

< ^P(^/(n,i;a/(a + l))' 

k=0 

a 

Ea 
, a+ 1 



< 



a. 



fc=o 



So ([?]) is satisfied also. 

Now to complete the proof we observe that, writing 96 f VT2 for 96(f VT2), 



P I T > 



log; 



96 r 2 V f 



< P AT, 



/ 1 1 96 to Vt i v 

r (U°g 2 — f — J) 



>0 , 



P(M > 96 f Vr 2 ) 



where [xj denotes the greatest integer less than or equal to x. 

. 1 1 96 t 2 Vt I ^ 

Applying Proposition 3.2 with m = n x 2 L ° g2 f Jf, we get 



(32) 



P ( A^ Llog2 ^ J) >0j < 3n(n + l)expl-- — -log 2 n 



< 3n(n + 1) exp (— log 2 n) . 



(33) 



The bound asserted in (||) follows from (|3^) and (|33|). The number of chain 
steps used equals 2n 2 x 2 T f = 2n 2 x M. 

□ 
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